Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

نویسندگان

  • David Guthrie
  • Mark Hepple
چکیده

We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1 byte per n-gram.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Improved Skips for Faster Postings List Intersection

Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...

متن کامل

Efficient Delay Characterization Method to Obtain the Output Waveform of Logic Gates Considering Glitches

Accurate delay calculation of circuit gates is very important in timing analysis of digital circuits. Waveform shapes on the input ports of logic gates should be considered, in the characterization phase of delay calculation, to obtain accurate gate delay values. Glitches and their temporal effect on circuit gate delays should be taken into account for this purpose. However, the explosive numbe...

متن کامل

بررسی وضعیت بایگانی پزشکی در بیمارستانهای آموزشی شهر خرم آباد در سال 1387

Results of thoughts, activities and measures taken in each organization are maintained as documents and records these records, which are created by spending great time and cost, contain valuable information and experiences that have an important role in promoting the goal or goals of the organization. therefore, medical records filing can be called memory of the organizations which weakness in ...

متن کامل

The technique of in-place associative sorting

In the first place, a novel, yet straightforward in-place integer value-sorting algorithm is presented. It sorts in linear time using constant amount of additional memory for storing counters and indices beside the input array. The technique is inspired from the principal idea behind one of the ordinal theories of “serial order in behavior” and explained by the analogy with the three main stage...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010